32 research outputs found

    A Hybrid Model for Sense Guessing of Chinese Unknown Words

    Get PDF
    PACLIC 23 / City University of Hong Kong / 3-5 December 200

    Expanding Chinese sentiment dictionaries from large scale unlabeled corpus

    Get PDF

    SESS: A Self-Supervised and Syntax-Based Method for Sentiment Classification

    Get PDF
    PACLIC 23 / City University of Hong Kong / 3-5 December 200

    Improving Cross-Domain Chinese Word Segmentation with Word Embeddings

    Full text link
    Cross-domain Chinese Word Segmentation (CWS) remains a challenge despite recent progress in neural-based CWS. The limited amount of annotated data in the target domain has been the key obstacle to a satisfactory performance. In this paper, we propose a semi-supervised word-based approach to improving cross-domain CWS given a baseline segmenter. Particularly, our model only deploys word embeddings trained on raw text in the target domain, discarding complex hand-crafted features and domain-specific dictionaries. Innovative subsampling and negative sampling methods are proposed to derive word embeddings optimized for CWS. We conduct experiments on five datasets in special domains, covering domains in novels, medicine, and patent. Results show that our model can obviously improve cross-domain CWS, especially in the segmentation of domain-specific noun entities. The word F-measure increases by over 3.0% on four datasets, outperforming state-of-the-art semi-supervised and unsupervised cross-domain CWS approaches with a large margin. We make our code and data available on Github

    Proposed clinical phases for the improvement of personalized treatment of checkpoint inhibitor–related pneumonitis

    Get PDF
    BackgroundCheckpoint inhibitor–related pneumonitis (CIP) is a lethal immune-related adverse event. However, the development process of CIP, which may provide insight into more effective management, has not been extensively examined.MethodsWe conducted a multicenter retrospective analysis of 56 patients who developed CIP. Clinical characteristics, radiological features, histologic features, and laboratory tests were analyzed. After a comprehensive analysis, we proposed acute, subacute, and chronic phases of CIP and summarized each phase’s characteristics.ResultsThere were 51 patients in the acute phase, 22 in the subacute phase, and 11 in the chronic phase. The median interval time from the beginning of CIP to the different phases was calculated (acute phase: ≤4.9 weeks; subacute phase: 4.9~13.1 weeks; and chronic phase: ≥13.1 weeks). The symptoms relieved from the acute phase to the chronic phase, and the CIP grade and Performance Status score decreased (P<0.05). The main change in radiologic features was the absorption of the lesions, and 3 (3/11) patients in the chronic phase had persistent traction bronchiectasis. For histologic features, most patients had acute fibrinous pneumonitis in the acute phase (5/8), and most had organizing pneumonia in the subacute phase (5/6). Other histologic changes advanced over time, with the lesions entering a state of fibrosis. Moreover, the levels of interleukin-6, interleukin-10 and high-sensitivity C-reactive protein (hsCRP) increased in the acute phase and decreased as CIP progressed (IL-6: 17.9 vs. 9.8 vs. 5.7, P=0.018; IL-10: 4.6 vs 3.0 vs. 2.0, P=0.041; hsCRP: 88.2 vs. 19.4 vs. 14.4, P=0.005).ConclusionsThe general development process of CIP can be divided into acute, subacute, and chronic phases, upon which a better management strategy might be based devised

    Word Segmentation for Chinese Novels

    No full text
    Word segmentation is a necessary first step for automaticsyntactic analysis of Chinese text. Chinese segmentationis highly accurate on news data, but the accuraciesdrop significantly on other domains, such as science andliterature. For scientific domains, a significant portionof out-of-vocabulary words are domain-specific terms, and therefore lexicons can be used to improve segmentationsignificantly. For the literature domain, however,there is not a fixed set of domain terms. For example,each novel can contain a specifiac set of person, organizationand location names. We investigate a method forautomatically mining common noun entities for eachnovel using information extraction techniques, and usethe resulting entities to improve a state-of-the-art segmentationmodel for the novel. In particular, we designa novel double-propagation algorithm that mines nounentities together with common contextual patterns, anduse them as plug-in features to a model trained on thesource domain. An advantage of our method is that noretraining for the segmentation model is needed for eachnovel, and hence it can be applied efficiently given thehuge number of novels on the web. Results on five differentnovels show significantly improved accuracies,in particular for OOV words

    Formalization and Rules for Recognition of Satirical Irony

    No full text
    Abstract-Satirical irony(" ") is a very important language phenomena. Its recognition is of great importance to sentiment analysis. However, researches on this topic are still quite rare and existing studies have problems such as unclear definition and unclear objects of study. To solve these problems, we first give clear definitions of satirical irony. Then we discuss in what level satirical irony occurs. Finally, we propose some features of satirical irony. Irony Satire Formalizaiton(key words

    A Survey of Local Differential Privacy and Its Variants

    Full text link
    The introduction and advancements in Local Differential Privacy (LDP) variants have become a cornerstone in addressing the privacy concerns associated with the vast data produced by smart devices, which forms the foundation for data-driven decision-making in crowdsensing. While harnessing the power of these immense data sets can offer valuable insights, it simultaneously poses significant privacy risks for the users involved. LDP, a distinguished privacy model with a decentralized architecture, stands out for its capability to offer robust privacy assurances for individual users during data collection and analysis. The essence of LDP is its method of locally perturbing each user's data on the client-side before transmission to the server-side, safeguarding against potential privacy breaches at both ends. This article offers an in-depth exploration of LDP, emphasizing its models, its myriad variants, and the foundational structure of LDP algorithms

    Dependency Tree Representations of Predicate-Argument Structures

    No full text
    We present a novel annotation framework for representing predicate-argument structures, which uses dependency trees to encode the syntactic and semantic roles of a sentence simultaneously. The main contribution is a semantic role transmission model, which eliminates the structural gap between syntax and shallow semantics, making them compatible. A Chinese semantic treebank was built under the proposed framework, and the first release containing about 14K sentences is made freely available. The proposed framework enables semantic role labeling to be solved as a sequence labeling task, and experiments show that standard sequence labelers can give competitive performance on the new treebank compared with state-of-the-art graph structure models

    A method for automatic POS guessing of Chinese unknown words

    No full text
    This paper proposes a method for automatic POS (part-of-speech) guessing of Chinese unknown words. It contains two models. The first model uses a machine-learning method to predict the POS of unknown words based on their internal component features. The credibility of the results of the first model is then measured. For low-credibility words, the second model is used to revise the first model's results based on the global context information of those words. The experiments show that the first model achieves 93.40% precision for all words and 86.60% for disyllabic words, which is a significant improvement over the best results reported in previous studies, which were 89% precision for all words and 74% for disyllabic words. Further, the second model improves the results by 0.80% precision for all words and 1.30% for disyllabic words. ? 2008. Licensed under the Creative Commons.EI
    corecore